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1.  Introduction 


Truncated  singular  value  decompositions  (tSVDs)  are  used  for  a  variety  of  tasks 
in  domains  ranging  from  dimension  reduction  using  principal  component  analysis 
(PCA)/  to  eigenfaces^  in  machine  learning,  to  latent  semantic  indexing  (LSI)^  in  in¬ 
formation  retrieval,  matrix  regularization,^’^  and  sundry  methods  in  signal  process¬ 
ing.^’^  It  is  the  core  operation  of  data  analysis  techniques  that  use  diagonal  matrix 
factorizations,  such  as  PCA^’^  and  proper  orthogonal  decomposition.^  In  all  cases, 
one  is  presented  with  M  points  ai,  ai, . . . ,  a,,,  embedded  in  R^.  These  are  assembled 
into  a  matrix  A  6  R^^^;  the  goal  is  to  find  a  reduced-dimension  approximation  A 
of  A,  where  A  e  R"^^,  with  n  N.  With  the  tSVD,  the  data  are  projected  into 
the  space  spanned  by  a  small  subset  of  singular  vectors;  these  are  the  n  singular 
vectors  that  have  the  n  largest  singular  values.  In  particular,  the  tSVD  provides  2 
key  advantages  for  dimension  reduction  applications: 

•  it  approximates  the  data  in  a  lower  dimension;  thereby  reducing  storage  and 
processing  costs  while  maintaining  important  features  of  the  data,  and 

•  it  exhibits  data  cleaning  properties  by  projecting  into  a  space  orthogonal  to 
dimensions  along  which  variance  is  relatively  small. 

The  latter  of  the  above  properties — data  cleaning — is  an  important  feature  of  tSVD 
methods  for  dimension  reduction  and  data  approximation.  Partitioning  R^  with  the 
tSVD  of  A  has  been  shown  to  separate  global  structure  of  columns  of  A  from  lo¬ 
cal  deviations  and  noise. Global  structure  is  represented  by  left  singular  vectors 
of  A  with  large  singular  values,  while  noise  is  represented  by  left  singular  vectors 
of  A  with  small  singular  values.  Important  dimension  reduction  methods  that  use 
the  tSVD  exhibit  better  performance  with  reduced  dimension  data  than  with  the 
original,  high-dimension  data;  for  example,  LSI  produces  reduced  dimension  ap¬ 
proximations  that  have  better  precision  and  recall  than  is  witnessed  with  the  same 
queries  on  the  original  data.^’^^ 

A  chief  drawback  of  tSVD  methods  is  that  they  are  computationally  expensive.  This 
drawback  has  led  several  authors  to  develop  approximations  to  the  tSVD  that  are 
computationally  cheaper.  Several  methods  that  substitute  a  Krylov  subspace  for  a 
truncated  singular  vector  space  have  been  proposed,  and  they  have  shown  great 
promise  for  reducing  computational  costs  while  only  yielding  small  differences  in 
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the  sum-of-squared  dimension  reduction  error  of  the  tSVD.  However,  analyses  of 
these  Krylov  subspace  methods  only  have  considered  the  sum-of-squares  approx¬ 
imation  error  difference  between  the  tSVD  and  a  Krylov  subspace  approximation. 
The  data  cleaning  properties  are  not  considered. 

It  is  somewhat  well  known  that  one  cannot  generate  a  sequence  of  Krylov  subspaces 
that  are  all  orthogonal  to  the  smallest  extremal  eigenvectors — ones  with  the  smallest 
eigenvalues,  no  matter  what  one  does  to  the  start  vector.  Bounds  in  Golub  et  al.^^ 
show  this  quantitatively;  in  fact,  if  one  has  a  random  start  vector,  then  a  Krylov 
subspace  used  as  a  tSVD  approximation  may  have  a  significant  overlap  with  the 
noise  space  (we  quantitatively  define  the  noise  space  in  Section  1.3  and  subspace 
overlap  in  Section  2.1)  after  a  small  number  of  iterations.  Without  filtering  the  start 
vector,  the  data  cleaning  properties  of  the  Krylov  subspace  approximation  methods 
are  poor.  The  presence  of  noise  destroys  the  important  data  cleaning  advantages  of 
low-rank  data  approximation. 

1 .1  Summary  of  Contents 

Krylov  subspace  tSVD  approximations will  almost  certainly  not  remove  noise 
as  well  as  the  tSVD  when  the  start  vector  of  the  Krylov  subspace  is  random.  Even 
when  the  start  vector  is  not  random,  if  the  start  vector  is  not  orthogonal  to  all  noise 
content,  then  noise  will  eventually  be  present  in  the  Krylov  subspace. 

It  is  well  known  that  the  principal  angles  between  a  Krylov  subspace  and  the  noise- 
space  vectors  will  shrink  as  one  grows  a  Krylov  subspace,  no  matter  how  close 
to  orthogonal  the  start  vector  is  to  the  noise  space.  However,  this  relationship  has 
not  been  well  studied;  there  is  no  theory  to  predict  how  “noisy”  a  Krylov  subspace 
approximation  of  the  tSVD  will  be.  Our  main  result  bounds  the  noise  content  of 
Krylov  subspaces  as  dimension  reduction  projections.  We  present  sufficient  con¬ 
ditions  that  guarantee  noise  filtering  of  the  Krylov  subspace  based  on  the  noise 
content  of  the  initial  vector. 

These  sufficient  conditions  allow  one  to  design  a  filtering  procedure  to  produce 
start  vectors  that  generate  Krylov  subspaces  with  bounded  noise  content.  One  may 
use  Krylov  subspace  tSVD  approximation  methods,  and  enjoy  the  noise  cleaning 
properties  of  the  tSVD  while  preserving  the  significant  computational  cost  savings 
of  the  Krylov  subspace  matrix  approximation  methods. 
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1 .2  Dimension  Reduction  and  the  tSVD 


The  exact  nature  of  the  data  approximation  and  analysis  problem  assumes  that  we 
have  data  embedded  in  an  //-dimensional  space;  that  is,  our  data  are  in  the  form 
of  vectors  a,  6  R.^  for  i  =  1,2, ...  ,M.  We  further  assume  that  all  the  data  a,  are 
assembled  into  a  matrix  A  =  [ai  a2  •  •  •  aM]-  Then,  all  the  methods  mentioned  in  the 
previous  section  obtain  dimension  reduction  and  data  cleaning  via  the  tSVD  of  A 
or  some  matrix  directly  derived  from  A.  The  chief  differences  between  PC  A,  LSI, 
eigenfaces,  proper  orthogonal  decomposition,^  and  the  like,  lie  in  the  derivation 
steps  applied  to  A  before  the  tSVD. 

We  now  proceed  to  formally  explain  the  SVD  and  define  the  tSVD.  We  assume 
without  any  loss  of  generality  that  N  >  M — otherwise  simply  replace  A  with  A^. 
Then,  any  N  x  M  matrix  A  can  be  factorized  as 

A  =  UEV^  (1) 


with  U  6  and  V  6  R.^^^  with  orthonormal  columns,  E  diagonal  and  having 
shape  M  X  M.  All  diagonal  elements  of  S  are  real  and  nonnegative.  Columns  of  U 
and  V  are  called  left  and  right  singular  vectors  of  A,  and  diagonal  elements  of  S  are 
called  singular  values  of  A.  We  write  cr,(A)  to  denote  a  diagonal  element  of  E,  and 
order  them  such  that  cri(A)  >  cr2(A)  >  •  •  •  >  cr^(A).  The  tuples  (M,-,v,-,cr,(A))  are 
singular  triplets  of  A.  We  call  a  singular  vector,  value,  or  triplet  leading  if  i  close  to 
1,  and  trailing  if  i  is  close  to  M. 


Definition  1 :  tSVD 

Let  A  =  UEV^  be  the  SVD  of  A.  Then  the  tSVD  of  A  is  given  by 


■  (n) 
^tSVD 


T 

n 


(2) 


where  U„  =  [ui  U2  •  •  •  u„],  V„  =  [vi  V2  •  •  •  v„],  and  =  diag(cri(A),  a-2(A), . . . , 
cr„(A)).  That  is,  U„  and  V„  are  composed  of  the  leading  n  left  and  right  singular 
vectors,  respectively,  and  is  composed  of  the  leading  n  singular  values. 

It  is  clear  from  the  definition  of  the  tSVD  that  it  is  a  projection  of  A  through  the 
n-dimensional  space  colspan  {U„}  formed  by  the  span  of  the  leading  left  singular 
vectors. 
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Also,  the  n-dimensional  tSVD  is  optimal  with  respect  to  the  Frobenius  norm,  so 

/V  (n)  ^ 

AtsvD  =  ||A  -  A\\f.  The  tSVD  produces  the  reduction  of  A  to  n- 

dimensions  that  is  optimal  in  the  least-squares  sense  aggregated  over  all  a,. 

1 .3  Signal  and  Noise  Spaces _ 

We  have  noted  that  truncating  the  SVD  also  can  de-noise  the  data.  We  define  signal 
and  noise  spaces  in  terms  of  the  SVD. 

Definition  2:  Signai  and  Noise  Space 

Suppose  A  is  an  arbitrary  real  matrix  with  SVD  A  =  UEV^.  Let  0  <  r  <  1  be  a 
nonnegative  real  number.  Then  the  noise  space  'l/noise  of  A  is  defined  as 

'Wnoise  =  span  jup,  Up+1, . . . ,  Um)  (3) 


where  p  is  the  smallest  natural  number  that  satisfies 


>  T. 


(4) 


The  signal  space  'Wsignai  of  A  is  then  defined  as  the  complement  of  'l/noise: 


signal 


‘noise 


(5) 


Remark  1.  When  the  mean  of  the  columns  a,  of  A  are  zero  centered  and  therefore 
=  0,  the  Gram  matrix  AA^  satisfies  AA^  =  5C,  where  5  is  some  scalar  and 
C  is  the  covariance  matrix  of  the  sample  a,.  In  this  case,  a  tSVD  of  A  is  equivalent 
to  projecting  out  the  N  -n  orthogonal  dimensions  along  which  variance  is  smallest. 


Thus,  the  noise  space  of  A  is  defined  as  the  space  in  which  less  than  t  of  the  Frobe¬ 
nius  norm  of  A  lies.  It  is  clear  that  as  r  ^  I,  p  approaches  that  index  of  the  smallest 
nonzero  singular  value.  This  illustrates  that  the  “noisiest”  singular  vectors  are  those 
with  the  smallest  singular  values. 

When  building  a  Krylov  subspace  for  low-rank  approximation,  we  want  to  guaran¬ 
tee  that  it  is  (nearly)  orthogonal  to  these  noisiest  singular  vectors  for  some  r  that  is 
close  to  1. 
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1 .4  Minimal  Krylov  Subspaces  for  Approximation  of  the  tSVD 


Though  the  tSVD  has  these  advantageous  properties,  it  can  be  expensive  to  com¬ 
pute,  especially  if  n  is  on  the  order  of  tens  or  more.  A  number  of  authors  have 
proposed  Krylov  subspaces  as  a  surrogate  for  the  leading  singular  vector  space  for 
dimension  reduction  tasks. Krylov  subspaces  are  defined  in  terms  of  square 
matrices.  When  A  is  not  square,  Krylov  subspaces  are  typically  defined  in  terms  of 
the  Gram  matrices  G  =  AA^  or  Gj  =  A^A.  These  2  Gram  matrices  transform  the 
singular  value  problem  into  an  equivalent  eigenvalue  problem  on  G  or  Gj;  given 
the  SVD  A  =  UEV^, 


G  =  AA^  = 

(6) 

Gt  =  A^A  =  VE^V^. 

(7) 

Hereafter,  when  we  write  T,-,  it  is  implied  that  T,-  =  /l,(G)  =  /1,(Gt)  =  cr,(A)^ 

Definition  3:  Krylov  Subspace 

Suppose  G  =  AA^  with  shape  N  x  N  and  6  R^.  Then  the  nth  Krylov  subspace 
is  given  by 


(G,  =  span  jz(°\  Gz(°\  G^z^^^  . . . ,  G^^'z^^^) .  (8) 

We  call  z^°^  the  start  vector  of  the  Krylov  subspace.  Approximation  error  of  a  sin¬ 
gular  vector  u,-  of  A  depends  on  the  angle  u,)  =  cos“^  /||z*^'^^|||iu,||, 

where  (•,  •)  is  the  inner  product,  and  on  the  distribution  of  singular  values  of  A.  The 
closer  z*^°^  is  to  u, — the  larger  cos  &  u,  j — the  better  the  approximation  of  u,  in 

7G(G,  When  no  beforehand  information  is  available,  is  typically  chosen 
to  be  random. 

Krylov  subspace  methods  have  been  well  proven  as  iterative  SVD  solvers. One 
projects  A  into  the  intersection  of  TC,  (G,  j  and  7C  (Gj,  Gz^°^  j,  and  each  iteration 
reduces  approximation  errors  of  extremal  singular  values.  In  fact,  the  tSVD  is  often 
computed  with  a  Krylov  subspace  solver.  The  difference  is  that  to  compute  an  n- 
dimensional  tSVD,  one  will  likely  need  to  generate  a  Krylov  subspace  TCj.  (G,  z^°^  j 
with  n,  and/or  repeatedly  select  a  better  start  vector  and  generate  a  new  Krylov 
subspace.  The  tSVD  approximation  methods'^^^^  instead  use  k  =  n:  a  minimal 
Krylov  subspace. 
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Definition  4:  Minimai  Kryiov  Subspace 

The  kih.  Krylov  subspace  (G,  j  is  said  to  be  minimal  for  a  reduction  to  n 
dimensions  iik  =  n. 

Remark  2.  In  typical  use  of  Krylov  subspaces,  one  generates  a  subspace  much 
larger  than  the  solution  space  that  is  needed.  The  wanted  solutions  are  then  extracted 
from  the  Krylov  subspace.  For  example,  if  one  wants  to  compute  n  eigenvalues,  one 
will  generate  (G,  j  with  k  >  n,  and  often  k^  n.  When  all  n  eigenvectors  are 
invariant  to  tolerance,  they  are  extracted  from  the  Krylov  subspace,  and  the  problem 
is  projected  into  the  space  spanned  by  those  n  computed  eigenvectors. 


1 .5  Approximate  Eigenvectors  and  Eigenvaiues  from  Kryiov  Subspaces _ 

An  orthonormal  basis  Zi,Z2, . . .  ,z„  of  approximate  eigenvectors  of  G  may  be  ex¬ 
tracted  from  a  Krylov  subspace  7C  (G,z'^‘’^j;  alternately,  these  are  approximate  left 
singular  vectors  of  A.  These  vectors  are  also  orthonormal  and  G-conjugate.  We  call 
an  approximate  eigenvector  z,  from  a  Krylov  subspace  a  Ritz  vector  and  the  value 
6i  =  zjGzi  a  Ritz  value.  Any  Ritz  vector  may  be  expressed  as  a  linear  combination 
of  eigenvectors  of  G. 


A  Ritz  vector  z,  is  not  necessarily  equal  to  the  projection  of  an  eigenvector  Q„Q^U;Ui 
(where  Q„  is  an  orthonormal  basis  for  (G,  z*^°^  j)  through  the  Krylov  subspace. 
That  is,  we  may — and  very  likely  will — have  z,  Q„Q^u,  for  all  i.  This  is  because 
z,  is  defined  as 


= 


min  arg  max 

dim(C)=(-l  x±C 


x^Gx 

W’ 


(9) 


while  QnQX  is  given  by 


QnQlm  =  arg  min  ||u;  -  x| 

xeW„(G,z(0)) 


(10) 


In  a  minimal  Krylov  subspace,  it  is  almost  certain  that  there  are  many  Ritz  vectors 
that  are  not  G-invariant;  that  is,  the  residual  ||Gv,  -  0,v,j|  is  greater  than  machine 
epsilon.  Ritz  vectors  from  Krylov  subspaces  are  defined  in  terms  of  a  polynomial 
q(x)  whose  roots  are  the  Ritz  values. Thus  q{6i)  =  0.  Noise  content  of  the  Ritz 
vector  depends  on  the  ratio  ,  where  p 

is  from  Definition  2.  Whenever  q(Aj)^  is  not  tiny  for  j  >  p,  then  the  Ritz 

vector  may  not  be  orthogonal  to  the  noise  space. 
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1.6  A  Motivating  Exampie 


Constructing  a  linear  classifier  is  a  task  that  can  motivate  our  discussion.  A  sim¬ 
ple  way  to  construct  a  linear  classifier  is  to  solve  Gx  =  /ij  -  /I2’  where  G  is  the 
covariance  matrix  and  and  1X2  are  the  means  of  the  2  classes.  This  problem  is 
ill-posed  when  G  has  small  singular  values  (a  nontrivial  noise  space),  and  we  want 
a  classifier  that  is  orthogonal  to  those  dimensions  along  which  variance  is  small. 

The  example  is  a  matrix  regularization  problem.  In  matrix  regularization  prob¬ 
lems, one  has  a  matrix  G  that  has  small  singular  values  and  one  seeks  a  solu¬ 
tion  that  minimizes  ||Gx  -  b||  but  is  also  orthogonal  to  the  singular  vector  space 
corresponding  to  the  small  singular  vectors  of  G.  The  small  singular  values  of  G 
make  minimization  of  ||Gx  -  b||  ill-posed,  as  small  perturbations  to  b  may  result 
in  a  large  perturbation  of  x.  One  instead  minimizes  a  regularized  problem  such  as 
||Gx  -  b||  -I-  ?7l|x||,  where  77  is  a  user-chosen  regularization  parameter  picked  to  avoid 
small  singular  vectors  of  G.  Using  a  tSVD  can  be  as  effective  as  directly  minimizing 
||Gx  -  b|i  -I-  77lix||  for  optimal 

If  one  substitutes  a  minimal  Krylov  subspace  (G,  j  with  a  random  for 
the  truncated  singular  vector  space,  then  the  influence  of  small  singular  values  is 
difficult  to  control  either  directly,  as  with  the  tSVD,  or  indirectly,  as  in  explicit 

^  (rt) 

minimization  of  |iGx  -  b|i  -l-  ?7l|x||.  An  x  computed  with  an  A  from  a  minimal 
Krylov  subspace  may  have  a  large  ||x||,  which  would  have  been  avoided  with  even 
a  small  77. 

Example  1.  Let  G  be  a  2, 000  x  2, 000  diagonal  matrix  defined  as 

G  =  diag(l,  1/2, 1/3, 1/4, . . . ,  1/1999,  lO-i’).  (1 1) 

Since  G  is  diagonal,  its  diagonal  entries  are  its  eigenvalues.  Moreover,  since  all 
its  eigenvalues  are  nonnegative,  its  spectral  decomposition  and  SVD  coincide.  The 
spectrum  of  G  is  shown  in  Fig.  1. 

Set  =  1  /  V2OOO  2?^?°  Ui-  We  have  cos  &  u,)  =  1  /  V2OOO  for  all  eigenvec¬ 

tors  u,.  We  compute  an  orthonormal  basis  Q„  for  (G,  and  the  G  =  Q^GQ 
for  1  <  77  <  20.  We  generate  a  random  b  and  solve  the  least  squares  problem 
X  =  argminy||T„,„y-b||. 
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eigenvalue  index 


Fig.  1  Spectrum  of  G  as  defined  in  Eq.  11 


/V  —  1  /V  T 

We  compute  the  Frobenius  norm  of  G  where  G  =  q:gq  for  the  Krylov  subspace 
solution  or  G  =  for  the  tSVD  solution;  the  values  are  shown  in  Fig.  2. 

Values  of  ||x||  are  also  shown  for  both  the  tSVD  and  minimal  Krylov  subspace  x. 
Figures  1  and  2  show  that  substituting  a  minimal  Krylov  subspace  for  a  truncated 
singular  vector  space  for  matrix  regularization  produces  poorer  regularization  re- 
suits.  The  Frobenius  norms  of  the  G  from  the  minimal  Krylov  subspace  are  at 

/V  —  1 

least  10  times  larger  than  the  tSVD  G  ,  and  the  ||x||  are  at  least  100  times  larger  for 
the  minimal  Krylov  subspace. 


Krylov  subspace  dimension  Krylov  subspace  dimension 

Fig.  2  Frobenius  norms  of  G  *,  where  G  is  G  restricted  to  “TCi^G, or  restricted  to  the 
truncated  singular  vector  space  span {m, . . . , u„)  (left).  Norms  ||x||  where  x  is  a  solution  to  the 
least  squares  problem  ||Gx  -  b||  and  where  G  =  Q^GQ„  where  Q„  is  an  orthonormal  basis  for 
“TCi  (G,  or  G  =  (right).  Large  values  indicate  greater  influence  of  small  singular 

vectors  in  x  and  more  sensitivity  to  small  perturbations  in  b. 
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2.  Corruption  of  Subspaces  with  Noise 


When  the  basis  veetors  of  the  Krylov  subspace  are  not  orthogonal  to  the  noise  space, 
then  the  Krylov  subspace  has  been  “corrupted”  by  noise.  The  proceeding  analysis 
requires  a  measurement  of  how  much  a  subspace  is  corrupted  with  noise.  It  is  ev¬ 
ident  that  measurement  of  the  corruption  of  a  space  S  by  noise  is  equivalent  to 
measuring  the  norms  of  images  of  noise  space  basis  vectors  Uj  projected  into  S .  We 
use  the  principal  angles  between  spaces  to  formalize  this  concept. 

2.1  Principal  Angles  for  Quantifying  Subspace  Overlap 

Much  of  the  proceeding  analysis  considers  the  principal  angles  between  spaces  (see 
Zhu  and  Knyazev^^Definition  2.1  and  Theorem  2.1]).  We  use  these  to  quantify  the 
overlap  between  2  subspaces  of  R^.  In  the  context  of  our  analysis  of  minimal  Krylov 
subspaces  for  approximating  the  tSVD,  we  would  like  to  have  the  overlap  between 
'K„(G,z«^>)  and'Wnoise  as  small  as  possible. 

Definition  5:  Principal  Angles 

Let  X  and  Y  be  matrices  in  R^  with  orthonormal  columns.  Then  the  principal  angles 
between  colspan  {X}  and  colspan  {Y}  are  defined  as 

§{X,Y)  =  [cos“^(cri(X^Y))  cos“^(cr2(X^Y))  •  •  •]  =  cos“Vo'(X^Y)).  (12) 


When  either  X  or  Y  has  only  one  column,  then  there  is  only  one  principal  angle. 
There  is  also  a  close  relationship  among  the  principal  angles  Y),  the  Frobe- 
nius  norm  ||X^Y||f ,  and  the  spectral  norm  ||X^Y||2.  The  spectral  norm  is  ||X^Y||2  = 
Vcos  (X,  Y)^  =  cri(X^Y)  and  the  Frobenius  norm  is  ||X^Y||f  =  -^2^i~cos^i"(X/Yy 

■^Z/=i  o"((X'''Y)2  where  X^Y  has  k  nonzero  singular  values. 

Clearly,  when  colspan  {X}  ±  colspan  {Y},  then  ??,(X,  Y)  =  njl  for  all  valid  i,  and 
||X^Y|i;7  =  ||X^Y||2  =  0.  The  closer  all  principal  angles  are  to  nj2,  the  smaller  the 
overlap  between  colspan  {X}  and  colspan  {Y}. 
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2.2  A  Measure  of  Corruption:  p-Free  of  Noise 


Principal  angles  and  the  elosely  related  matrix  norms  ||  •  ||f  and  ||  •  \\2  lead  naturally 
to  a  measure  of  the  overlap  between  the  noise  spaee  'l/noise  and  some  subspaee  S 
with  orthonormal  basis  W:  p-free  of  noise. 

Definition  6:  p-Free  of  Noise 

Let  <S  c  be  some  subspaee,  let  W  be  an  orthonormal  basis  for  S,  let  'Z/noise  c  R^ 
be  the  noise  spaee  of  G,  and  Unoise  be  an  orthonormal  basis  for  'Z/noise-  Pick  some 
nonnegative  real  p.  Then  S  is  p-free  of  noise  if 

eos  (Unoise,  W)  <  p.  (13) 

An  equivalent  eondition  to  Eq.  13  is  that  l|Uj[j,;j.gW||2  <  p.  Sinee  l|Uj[gij,gW||2  < 
liuL,  ,wiif,  liuL,  ,wiif  <  p  also  implies  that  S  is  p-free  of  noise. 

3.  Two  Sufficient  Conditions  on  for  TCi  (g,  to  Be  p-Free  of  Noise 

We  are  now  ready  to  present  our  main  result.  We  develop  2  eriteria  that  guarantee 
that  the  nth  Krylov  subspaee  is  p-free  of  noise. 

3.1  A  Basic  Sufficient  Condition  on  for  an  Uncorrupted  Subspace 

Our  basic  sufficient  condition  comes  from  Corollary  1,  which  depends  on  Lemma  1 
and  Theorem  1.  First,  we  use  the  Lanczos  recurrence  (see  Saad,^^  Section  3.2)  to 
bound  the  cosine  cos  j?(q„+i,u,)  in  Lemma  1,  where  is  a  basis  vector  gener¬ 
ated  by  the  Lanczos  algorithm. This  result  then  leads  to  a  recurrence  relation  that 
bounds  the  image  of  the  eigenvector  U;  projected  into  TC  (G,  in  Theorem  1; 
the  sufficient  condition  in  Corollary  1  on  follows  from  that. 

We  begin  with  bounding  the  cosine  cos  &  (q„+i ,  u,). 

Lemma  1 

Let  q„^i  be  the  n  -i-  1th  basis  vector  generated  by  the  Lanczos  algorithm  acting 
on  a  Gram  matrix  G  and  z^°k  Let  Q„  be  an  orthonormal  basis  for  7C  (G,  Let 
=  Q^GQ„  be  the  restriction  of  G  to  TC  (G,  z*^®^ j,  and  =  QJu,.  Note  that 
T„  „  is  a  tridiagonal  matrix. Let  be  the  norm  of  the  residual  of  the  Lanczos 
algorithm  after  the  nth  step.  Order  the  singular  values  of  a  matrix  A  as  cri(A)  > 
cr2(A)  >  ■  •  •  >  cr„(A),  and  the  eigenvalues  of  G  as  di  >  /I2  >  •  ■  ■  >  /1a?.  Then 

Approved  for  publio  release;  distribution  is  unlimited. 


10 


cos  ??(u;,q„+i)  obeys 


cosi^(u,-,q„+i)  < 


A 


'n+l 


Proof.  Let  Q„  be  the  orthonormal  basis  generated  for  (G,  j  by  the  Lanczos 
algorithm  after  n  steps.  Then  we  have 

GQ„  =  Q„T„^„  +  r„e^. 

Write  the  inner  product  on  R.^  as  (•,  •).  Left-multiplying  both  sides  by  uj  gives 

+  ujr„e^ 

ujGQ„  -  =  ulr^el 


and 

/3n+i(Ui,  q„+i)e^  =  uJGQ„  -  uJQ„T„,„ 


as  both  q„+iyS„+i  =  r„.  Then 


^„+il<u,-,q„^i)|  =^„+ilKu,-,q„^i)e^||  =  ||u[GQ„  -  u[Q„T 


and 


Ku,-,q„+i)|  = 


||ujGQ„-ujQ„T„ 


lid,-u 


(n)T 


U 


(n)Tr, 


■-  n,n\ 


A 


n+l 


A 


n+l 


as  J3n+i  >  0.  Noting  that  cos  q„^i)  =  |(u,-,  q„^i)|  when  both  vectors  are  unit- 
length  and 


-  T„,„)||  <  ||u^"1|(ri(T„,„  -  U) 


(m)| 


(14) 


completes  the  proof.  □ 

Remark  3.  Eq.  14  may  also  be  bounded  from  below  as 


I  -  (n)T 

U 


(Id,- 


> 


(^n(^n,n  Id,-). 


One  could  use  this  result  to  obtain  a  different  necessary  condition  for  a  noise-free 
Krylov  subspace. 
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We  now  use  the  result  of  Lemma  1  to  bound  ||u|”^||.  This  will  result  in  a  basic  suffi¬ 
cient  condition. 


Theorem  1 

Let  G,  the  d,,  and  „  be  defined  as  in  Lemma  1.  Let  be  any  of  the  sub- 
and  super-diagonal  values  of  and  let  ^  for  A:  =  1, 2, . . . ,  n.  Let  be  the 
projection  of  eigenvector  u,-  into  Suppose  that  ||u|°^||  <  s  and  <  01, 

where  6]  is  the  principal  eigenvalue  of  T„  Then  the  norm  of  is  bounded  as 


\ 


Proof.  Let  9i  be  the  principal  eigenvalue  of  T„„.  We  have  0  <  6i  <  Ai — G  is  a 
Gram  matrix  and  0  <  An  <  0i — and  it  is  assumed  that  d,  <  6i .  Then  we  get 


f  \(^k,k  -  Id,)  <  di  -  d;. 


Applying  Lemma  1  gives 


cos??(u,',q„+i)  < 


'n+\ 


l|fi!"^ll(Ti-d,-) 


< 


Then  we  can  express  ||u^”^||  recursively,  as 


'uS”^|p  =  cosy^(u,-,q„)^  -t  ||u' 


2  ,  I 1) 1 12 

i  II  * 


So 


u^^)||2(dl  -  d,-)^ 


<  liuf' 


-i),2i(Ai-A,f 


+  1  . 


The  closed  form  for  this  series  is 


u 


(«)||2 


U' 


(0)||2 


/(Ti-d,)2 

I 
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This  result  immediately  leads  to  a  suffieient  eondition  on  for  (G,  j  to  be 
p-free  of  noise. 

Corollary  1 

Let  G  and  z^^^  be  given  where  eigenvalues  of  G  are  Ti  >  /I2  >  •  •  •  >  Xn,  and  let 
N  -  pht  the  dimension  of  the  noise  space  'l/noise-  Suppose  that  ||uf^||  <  s  for  any 
noise  eigenvector  u,-.  Then  (G,  z*^®^  j  is  p-free  of  noise  if 

+  <p\  (15) 


This  is  due  to  an  application  of  Theorem  1  to  bound  the  quantity  ||uj^^||,  and  noting 
that  ||u^Q„||  <  IIUj^oiseQlb-  This  application  of  Theorem  1  is  always  possible,  since 
/Ift?  is  always  less  than  or  equal  to  any  eigenvalue  of  T„  „ — the  assumption  that  /Ia?  < 
6\  is  always  valid. 

Remark  4.  Corollary  1  uses  a  lower  bound^  on  the  Lanczos  residuals  f3j.  This  value 
is  not  known  a  priori,  but  it  was  conjectured  g^^  that  no  /3j  becomes  negligible.  The 
cases  for  which  (Sj  does  attain  a  small  value  is  when  the  Krylov  subspace  is  invariant 
or  nearly  so.  We  have  observed  that  the  median  eigenvalue  is  often  a  reasonable 
lower  bound  on  the  from  the  Lanczos  algorithm. 

We  now  proceed  with  an  example  in  which  the  median  eigenvalue  is  a  reasonable 
lower  bound  for^. 

Example  2.  We  continue  with  the  matrix  G  defined  in  Example  1.  Set  p  =  0.001. 
Since  'l/noise  is  defined  by  the  trailing  100  eigenvectors,  N  -  p  =  100.  We  transform 
Eq.  15  to  a  condition  on  s: 

s  < 
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We  supposed  that^  is  greater  than  the  median  eigenvalue  diooo-  We  then  compute 
the  maximum  e^ax  that  will  satisfy  Eq.  16  for  1  <  n  <  10. 

To  show  the  pessimism  of  these  Smaxj  we  also  used  line  search  on  a  posteriori 
measured  values  of  cos('l/noise5  Qn)>  where  Q„  is  a  basis  for  *7C,  (G,  to  find  a 
value  Scomputed:  a  maximum  observed  value  where  cos  &  (uy,  j  <  Scomputed  implies 
is  p-free  of  noise  and  where  Uy  is  a  noise  eigenvector.  We  computed 
^computed  to  15  Significant  digits.  The  values  for  e^ax  and  Scomputed  are  shown  in  the 
left-hand  plot  of  Fig.  3. 

The  purpose  of  this  example  is  to  verify  the  values  of  emax  and  Scomputed-  We  com¬ 
puted  bases  Q„  for  TC  (G,z^°^)  for  1  <  n  <  10,  and  used  the  computation  to  verify 
that^  >  diooo-  We  set 


,(0) 


1  -  (100e2) 


1900 


2000 


1900 


^u,  +  e  ^  u,-, 


(=i 


(=1901 


(17) 


where  s  is  either  Smax — defined  by  equality  in  Eq.  16,  or  e  is  ecomputed-  We  then  com¬ 
puted  an  orthonormal  basis  for  TC  (G,  and  measured  ||Uj[Q;j,gQ„||.  The  cos  &  U/j 
s  for  any  noise  eigenvector  u,-.  The  values  of  ||Uj[p;j,gQ„||2  for  both  e^ax  and  ecomputed 
are  shown  in  the  right  plot  of  Fig.  3. 


Fig.  3  Values  of  Smax  and  ecomputed  (left).  Values  of  where  Q„  is  a  basis  for  TC,  (g, 

and  is  defined  as  in  Eq.  17  for  either  Cmax  or  ecomputed  (right).  The  small  values  of  ||UEj,gQ,J|2 
for  emax  indicate  that  Corollary  1  is  pessimistic  for  this  example. 


We  notice  that  Corollary  1  is  pessimistic.  Satisfying  Eq.  15  may  be  difficult,  since 
((/li  -  l^y  may  grow  rapidly.  Even  for  small  values  of  n,  producing  a  suffi¬ 
ciently  small  cos  Uyj  may  be  impossible  in  finite  precision.  Even  when  Eq.  15 
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is  not  satisfied,  it  may  be  the  case  that  the  Krylov  subspace  is  p-free  of  noise.  There¬ 
fore,  we  present  a  tighter  a  posteriori  sufficient  condition  that  uses  information  from 
the  Krylov  subspace. 

3.2  A  Sharper  Sufficient  Condition  on 

We  notice  that  the  basic  sufficient  condition  is  pessimistic,  and  not  practical  for 
n  greater  than  10  or  so.  Our  second  sufficient  condition  gives  us  extra  sharpness 
to  extend  a  sufficient  condition  for  larger  n.  Lemma  2,  about  the  polynomial  that 
defines  Ritz  vectors,  directly  gives  another  sufficient  condition,  which  we  illustrate 
in  Example  3.  A  nice  feature  of  the  condition  that  comes  from  Lemma  2  is  that  the 
quantities  can  be  computed  from  byproducts  of  the  Lanczos  algorithm,  as  is  noted 
in  Remark  5. 

Lemma  2 

Let  the  positive  semidefinite  matrix  G  and  vector  define  the  Krylov  subspace 
’Kn  (G,  j,  and  lei  9i  >  62  >  ■  ■  ■  >  On  be  the  Ritz  values  of  the  restriction  of  G  to 
(G,  with  unit-length  Ritz  vectors  Zi,  Z2, . . . ,  z„.  Define  the  polynomial  qj{x) 
as 


=  n 

t=  1 

kit  j 


and  set  Ci  =  where  (•,  •)  is  the  inner  product  on  R^.  Suppose  p  <  i  <  N. 

Then  the  magnitude  of  the  cosine  cos  ■&  (u,,  Zyj  is  bounded  as 


cos 


max 

^<x<Ap 


\qj{x)ci\ 

lk/(G)z(0)|r 


Proof.  The  polynomial 


=  n 

t=  1 

k*  j 
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gives  the  7* 


Ritz  veetor  of  (G,  ( 

eosi^(u;,zy)  = 


Zj  =  qy(G)z*^°Vllq/G)z^°^||j.  Therefore, 

\qMi)ci\ 

^  lkXG)z(o)|r 


Sinee  i  <  p  and  G  is  positive  semidefinite,  0  <  d,-  <  Ap  and  |<2'/d,)|  <  maxo<;c<i^  i<?(-r)|. 
Then 


eos 


\qj(x)ci\ 

0<x<Ap  II^XG)z(0)||’ 


whieh  eompletes  the  proof.  □ 

Remark  5.  The  polynomial  qj(x)  and  the  norm  ||^y(G)z*^‘’^||  ean  be  eomputed  a  pos¬ 
teriori  inexpensively  as  a  by-produet  of  a  standard  Krylov  subspaee  algorithm,  sueh 
as  the  Lanezos  algorithm.  However,  c,-  is  typieally  unknown,  but  ean  be  bounded 
if  has  known  properties.  For  example,  if  z^°^  is  a  random  veetor,  then  |c,|  may 
be  probabilistieally  bounded  with  enough  tightness  as  to  result  in  tight  bounds  for 
eos  &  (u,,  z).  For  our  example  ease,  where  eigenveetors  are  all  standard  basis  veetors 
(in  Dettman,^"^  p.  Ill)  and  (u,, x)  =  x,  for  x,  is  the  ith  entry  of  x,  this  is  straightfor¬ 
ward. 


Example  3.  Let  G  be  defined  as  in  Example  1.  Set  z*^°^  to  be  a  random  veetor  with 
entries  drawn  from  the  normal  distribution  NiO,  1).  We  eompute  TCe  (G,  j  and 
eompute  the  Ritz  values  and  qj(x)  for  1  <  j  <  6.  Eaeh  qj(x)  is  a  quintie  polynomial; 
their  derivatives  give  their  maxima  between  Ritz  values  0,-,  are  quartie,  and  ean  be 
solved  analytieally.  We  eomputed  6(,  >  An  and  maxo<;c<i^,  \qj(x)\  <  maxo<x<e6  \qj(x)\, 
so  it  is  suflieient  to  eonsider  maxima  of  \qj(x)\  for  0  <  x  <  6(,.  Eor  all  qjix),  i^/O)!  = 
maxo<x<ip  \qj(x)\.  We  now  proeeed  to  bound  c,-. 

Sinee  z*^°^  is  random,  zero-eentered,  and  normally  distributed,  we  may  plaee  an 
upper  bound  on  eos  (z^°\  Uy j  by  notieing  that  the  standard  basis  veetors^"^  ej  = 
[0  0  •  •  •  0  1  0  •  •  •  0]  are  eigenveetors  of  G.  Then  and 

_  _  zf 

“  ||z(0)||  “ 

where  is  the  j’th  entry  of  z^°\  As  entries  of  z*^°^  are  drawn  from  N(0, 1),  the 
squared  norm  of  z^^^  follows  a  Chi-squared  distribution  with  N  -  I  degrees  of  free¬ 
dom.  Write  C/st(o,i)(a)  as  the  eritieal  value  N(0, 1)  and  probability  a,  and  C^2(a,N) 
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is  the  critical  value  fox Xn  probability  a.  Then,  with  probability  1  -  2a, 

Ca'(0,1)(1  -  «)  ^  ^  Cyv(0,l)(<2) 

7^2(1  -  a,  iV-  1)  “  “  ^C^2{\  -  a,N  -  1) 

Due  to  the  symmetry  of  N(0, 1),  this  is  equivalent  to 

,  I  Cm(q,i){ci) 

\Ci\  <  — 

^C^2(a,N-l) 

For  a  =  0.99  and  N  =  2000,  we  have  CA^(o,i)(a)  ~  2.33  and  C^2(l  -  a,N  -  1)  ~ 
1854.86.  Then  |c,|  <  0.0542  with  probability  at  least  0.99. 

We  combine  this  upper  bound  on  |c,|  with  the  computed  maxima  of  ||^/.r)||  over 
[0,  Tp]  and  the  values  of  |<?y(G)z'^°^|  to  get  upper  bounds  on  cos  The  re¬ 

sults  are  shown  in  Table  1,  and  we  compare  these  with  the  computed  values  for 
maxu^g-i/noise The  upper  bounds  on  cosi^(u;,Zyj  are  tighter  than  those 
produced  from  Corollary  1.  Also,  it  is  clear  from  Table  1  that  the  space  span  {z2,  Zi} 
is  p-free  of  noise  for  all  p  >  0.0007. 


Table  1  Computed  values  for  maxo<x<4  \qj{x)\,  ||^j(G)z*°^||,  probabilistic  upper  bounds  on 
cost?(ui,z;),  and  computed  cos??(ui,Zj)  for  fho  Ritz  vectors  zj  from  TCe  (G, 

Small  values  of  ||^j(G)z*°*||  contribute  substantially  to  large  upper  bounds  on  cos  )?(u;,  z,). 


Quantity 

Z6 

Z5 

Z4 

Z3 

Z2 

Zl 

max  \qi(x)\ 

0<x<6f, 

8.05x10-^ 

3.55x10-5 

8.32x10-5 

4.40x10-5 

2.56x10-5 

1.28x10-5 

lk;(G)z(‘*)|| 

7.4x10-^ 

7.8x10-5 

4.19x10-5 

4.68x10-5 

2.1x10-^ 

5.56x10-5 

probabilistic  max 
cos  7?  (uj,  z,  j 

Ci  bounded 
with  a  =  0.99 

5.87x10^2 

2.46x10-2 

1.07x10-2 

5.09x10-5 

6.69x10-* 

1.25x10-5 

computed  max 
cos  7?  (uy,  Z;  j 

4.02x10^2 

1.16x10-2 

5x10-5 

2.37x10-5 

3.11x10-* 

5.78x10-5 

The  bounds  in  Lemma  2  indicate  that  noise  may  be  due  to  either  a  large  value  of 
max()<x<ip  |p/.r)c,|  or  due  to  a  relatively  small  value  of  |ipj(G)z*^°^||. 
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4.  Conclusion 


We  have  presented  sufficient  conditions  for  a  Krylov  subspace  approximation  of  the 
tSVD  to  be  p-free  of  noise.  Generally  speaking,  one  can  see  that  minimal  Krylov 
subspace  substitutions  for  the  tSVD  are  doomed  to  be  noisy  if  n  is  large  enough, 
even  if  the  start  vector  is  orthogonal  to  the  noise  space  up  to  machine  precision. 
However,  for  moderate  n,  one  can  use  the  sufficient  conditions  to  design  a  filter  to 
produce  a  start  vector  that  has  a  small  enough  cos  for  noise  space 

U;  so  that  TCj  (G,  z^°^  j  is  p-free  of  noise.  We  are  then  motivated  to  find  methods  to 
compute  good  start  vectors  for  minimal  Krylov  subspaces. 

When  considering  methods  to  prepare  start  vectors  that  satisfy  the  sufficient  con¬ 
ditions  presented  here,  we  recall  the  overarching  purpose  for  minimal  Krylov  sub¬ 
space  approximations  to  the  tSVD:  dramatically  smaller  compute  times.  Start  vec¬ 
tor  generation  methods  that  have  smaller  computational  costs  are  preferable.  Start 
vectors  may  be  implicitly  filtered  with  either  Implicitly-Restart  Lanczos^^  or  Thick- 
Restart  Lanczos^^  when  n  is  small,  and  Lemma  2  gives  a  criterion  for  which  Ritz 
vectors  to  discard.  When  n  is  large,  the  asymptotic  cost  of  computing  Ritz  vectors — 
0{n^N) — may  become  prohibitive.  Implicit  start  vector  filtering  may  also  be  less  at¬ 
tractive  when  matrix-vector  products  scale  better  than  dot  products  or  matrix  norms, 
as  is  the  case  for  some  classes  of  distributed  matrices.  Then  filtering  methods,  such 
as  Chebyshev  polynomials  or  approximation  of  sigmoidal  functions^’  may  be  less 
expensive.  Since  G  is  a  positive  semi-definite  matrix,  simple  power  iteration  on  a 
block  vector  as  in  Halko  et  al.^^  may  also  be  an  effective  method  for  preparing  a 
start  vector  that  satisfies  our  sufficient  conditions. 
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